Team Members

This report is for Team A, which includes:

Introduction

The United States forest Service (UFS) considers itself a multi-faceted agency. Cumulatively, the agency manages and protects 154 national forests and 20 grasslands in 43 states and Puerto Rico including approximately 500 million acres of private, state and tribal forests, on which the UFS promotes sustainable management. The UFS mission is to sustain the health, diversity, and productivity of the nation’s forests and grasslands to meet the needs of present and future generations. Some highlights of the UFS responsibility are shown below:

o 13 billion dollars contributed to the U.S. economy by visitor spending each year o 193 million acres managed by the Forest Service o 27 million annual visits to ski areas on national forests o 7.2 million acres of wetlands o 36.6 million acres of wilderness o 400,000 acres of lakes o 57,000 miles of streams o 10,000 professional wildland firefighters o 154 national forests o 20 percent of America’s clean water supply provided by the national forests and grasslands Source: https://www.fs.fed.us/about-agency/newsroom/by-the-numbers

Like any managerial process, the UFS’s success is measured in terms of how well it maximizes its use of resources (time, people, and money) to deliver value to their customers, in this case the US taxpayer/government. To maximize their impact, the UFS has had success creating partnerships with public and private agencies that help the UFS plant trees, improve trails, educate the public, and promote sustainable forest management and biodiversity conservation domestically and internationally.

Given the large area, various climates and terrains, diverse plants and forest species, and wildlife that make surveying the locations on any kind of frequent basis, the Forest Gladiator team has been tasked with developing a predictive model(s) using existing data to improve the UFS’s management capabilities involving forest cover areas.

The Modeling Problem

There is a need to predict the forest cover for 30 x 30 meter cells using cartographic variables obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. The objective is to predict cover type as a multiclass classification problem based on the associated attributes (features).

The Data: Data Inventory and Data Quality Check

In our raw data it is clear that we don’t have an even distribuion of the target variable (cover types). The data set is over 85% Spruce Fir and Lodgepole Pine. Another nearly 10% are Ponerosa Pine and Krummholz. This needs to be taken into account as we split the data for training and testing. It also gives us a good benchmark for our models as we check the outputs on the test data.

From a quick skim of the data we see it is already pretty clean. There are no missing data and nothing jumps out as obviously out of the ordinary. There are clear groups of variables that we will explore for patterns such as soil type, distance measures, shade at different times of day, and wilderness area.

set.seed(10)

train_index <- caret::createDataPartition(
  y = raw_data$cover_type,
  p = 0.7,
  times = 1,
  list = FALSE
)

raw_train <- raw_data[train_index,]
raw_test  <- raw_data[-train_index,]

skim(raw_train) %>% pander() 

Skim summary statistics
n obs: 406710
n variables: 55

Table continues below
variable missing complete n n_unique
cover_type 0 406710 406710 7
top_counts ordered
lod: 198311, spr: 148288, pon: 25028, kru: 14357 FALSE
Table continues below
variable missing complete n mean
aspect 0 406710 406710 155.52
elevation 0 406710 406710 2959.23
hillshade_3pm 0 406710 406710 142.49
hillshade_9am 0 406710 406710 212.16
hillshade_noon 0 406710 406710 223.31
horizontal_distance_to_fire_points 0 406710 406710 1981.43
horizontal_distance_to_hydrology 0 406710 406710 269.17
horizontal_distance_to_roadways 0 406710 406710 2349.75
slope 0 406710 406710 14.11
soil_type1 0 406710 406710 0.0052
soil_type10 0 406710 406710 0.056
soil_type11 0 406710 406710 0.021
soil_type12 0 406710 406710 0.052
soil_type13 0 406710 406710 0.03
soil_type14 0 406710 406710 0.001
soil_type15 0 406710 406710 7.4e-06
soil_type16 0 406710 406710 0.0049
soil_type17 0 406710 406710 0.0059
soil_type18 0 406710 406710 0.0033
soil_type19 0 406710 406710 0.007
soil_type2 0 406710 406710 0.013
soil_type20 0 406710 406710 0.016
soil_type21 0 406710 406710 0.0014
soil_type22 0 406710 406710 0.058
soil_type23 0 406710 406710 0.099
soil_type24 0 406710 406710 0.036
soil_type25 0 406710 406710 0.00083
soil_type26 0 406710 406710 0.0044
soil_type27 0 406710 406710 0.0019
soil_type28 0 406710 406710 0.0016
soil_type29 0 406710 406710 0.2
soil_type3 0 406710 406710 0.0083
soil_type30 0 406710 406710 0.052
soil_type31 0 406710 406710 0.044
soil_type32 0 406710 406710 0.09
soil_type33 0 406710 406710 0.077
soil_type34 0 406710 406710 0.0027
soil_type35 0 406710 406710 0.0033
soil_type36 0 406710 406710 0.00017
soil_type37 0 406710 406710 0.00052
soil_type38 0 406710 406710 0.027
soil_type39 0 406710 406710 0.024
soil_type4 0 406710 406710 0.021
soil_type40 0 406710 406710 0.015
soil_type5 0 406710 406710 0.0028
soil_type6 0 406710 406710 0.011
soil_type7 0 406710 406710 0.00017
soil_type8 0 406710 406710 0.00031
soil_type9 0 406710 406710 0.0021
vertical_distance_to_hydrology 0 406710 406710 46.39
wilderness_area1 0 406710 406710 0.45
wilderness_area2 0 406710 406710 0.052
wilderness_area3 0 406710 406710 0.44
wilderness_area4 0 406710 406710 0.064
sd p0 p25 p50 p75 p100 hist
111.9 0 58 127 260 360 ▇▇▆▃▃▃▃▆
280.02 1859 2809 2995 3163 3858 ▁▁▂▃▇▆▁▁
38.3 0 119 143 168 254 ▁▁▂▆▇▆▂▁
26.76 0 198 218 231 254 ▁▁▁▁▁▂▇▇
19.78 0 213 226 237 254 ▁▁▁▁▁▁▅▇
1324.59 0 1024 1711 2551 7173 ▅▇▆▂▁▁▁▁
212.33 0 108 218 384 1397 ▇▆▃▂▁▁▁▁
1558.92 0 1103 1998 3326 7117 ▆▇▆▅▃▂▂▁
7.5 0 9 13 18 66 ▅▇▅▂▁▁▁▁
0.072 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.23 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.14 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.22 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.17 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.032 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.0027 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.07 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.077 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.057 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.083 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.11 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.13 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.037 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.23 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.3 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.19 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.029 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.066 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.043 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.041 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.4 0 0 0 0 1 ▇▁▁▁▁▁▁▂
0.091 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.22 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.21 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.29 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.27 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.052 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.057 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.013 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.023 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.16 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.15 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.14 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.12 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.052 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.11 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.013 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.018 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.045 0 0 0 0 1 ▇▁▁▁▁▁▁▁
58.33 -166 7 29 69 601 ▁▇▇▂▁▁▁▁
0.5 0 0 0 1 1 ▇▁▁▁▁▁▁▆
0.22 0 0 0 0 1 ▇▁▁▁▁▁▁▁
0.5 0 0 0 1 1 ▇▁▁▁▁▁▁▆
0.24 0 0 0 0 1 ▇▁▁▁▁▁▁▁

If we set aside the larger groups of variables, we get three other variables to look at: aspect, elevation, and slope. A quick look shows that elevation may help separate the cover types. There may also be interesting ways to combine these variables.

raw_train %>% 
  dplyr::select(cover_type, aspect, elevation, slope) %>% 
  gather(-cover_type, key = key, value = val) %>% 
  ggplot(aes(cover_type, val, fill = cover_type)) +
  geom_boxplot() +
  coord_flip() +
  facet_grid(~key, scales = "free") +
  theme(legend.position="none")

A density plot of elevation shows there is overlap but the groups are reasonably separated. This will likely be an important variable for our models.

raw_train %>% 
  ggplot(aes(x=elevation)) + 
  geom_density(aes(group=cover_type, color=cover_type, fill=cover_type), alpha=0.3)

EDA: Initial exploratory data analysis.

The first group of variables are the 40 soil types. By plotting the proportion of each cover type contains each soil type we get some obvious clusters. Likely in modeling we will want to group the soil types. Initial insights show that Krumholz is mostly in soil types 30 through 40. Ponderosa Pine and Cottonwood Willow are 0 through 20.

Soil types present in a cell can help eliminate the options for cover type in our model.

train_soil <- raw_train %>%
  dplyr::select(cover_type, contains("soil")) %>%
  gather(-cover_type, key = soil, value = val) %>%
  mutate(soil = parse_number(soil), n=1) 

train_soil %>% 
  group_by(cover_type, soil) %>%
  summarise(total = sum(val),
            prop  = total / sum(n)) %>% 
  ggplot(aes(soil, prop, fill = cover_type)) + 
  geom_col() +
  facet_grid(~cover_type) +
  theme(axis.text.x = element_text(size = 6, angle=45),
        legend.position="none") +
  coord_flip()

Looking further into the soil types we can see that cover types usually have mostly soil types over or under 20, not both, except for aspen which is spread across the spectrum.

train_soil %>% 
  dplyr::filter(val > 0) %>% 
  ggplot(aes(cover_type, val, fill = soil)) +
  geom_col(position="fill") +
  coord_flip()

The next grouping of variables is wilderness area. Again we see some clear delineations aong the cover types. Cottonwood Willow is only in area four. Ponderosa Pine and Douglasfir have three and four, whereas the reaining cover types are primarily three and one.

train_wild <- raw_train %>%
  dplyr::select(cover_type, contains("wilderness")) %>%
  gather(-cover_type, key = wild, value = val) %>%
  mutate(wild = parse_number(wild), n=1) 

train_wild %>% 
  group_by(cover_type, wild) %>%
  summarise(total = sum(val),
            prop  = total / sum(n)) %>% 
  ggplot(aes(wild, prop, fill = cover_type)) + 
  geom_col() +
  facet_grid(~cover_type) +
  theme(axis.text.x = element_text(size = 6, angle=45),
        legend.position="none") +
  coord_flip()

Another view of the proportions confirms this.

train_wild %>% 
  dplyr::filter(val > 0) %>% 
  ggplot(aes(cover_type, val, fill = wild)) +
  geom_col(position="fill") +
  coord_flip()

Interestingly wilderness areas three and one are negatively correlated. This may be useful for our models.

raw_train %>% 
  dplyr::select(contains("wilderness")) %>% 
  cor() %>% 
  melt() %>% 
  mutate(v1 = parse_number(Var1),
         v2 = parse_number(Var2)) %>% 
  ggplot(aes(v1, v2, fill = value)) +
  geom_tile()

Combining soil types and wilderness areas into one plot gives us some obvious clusters in the training data. Further investigation of how to best group these variables will be useful for the modeling phase.

raw_train %>% 
  dplyr::select(cover_type, contains("soil"), contains("wilderness")) %>% 
  mutate(n=1) %>% 
  gather(contains("soil"), key = soil, value = soil_val) %>%
  gather(contains("wilderness"), key = wild, value = wild_val) %>%
  dplyr::filter(soil_val > 0, wild_val > 0) %>% 
  group_by(cover_type, soil, wild) %>% 
  summarise(n = sum(n)) %>% 
  ungroup() %>% 
  group_by(cover_type) %>% 
  mutate(prop = n / sum(n)) %>%
  mutate(soil = parse_number(soil), 
         wild = parse_number(wild)) %>% 
  ungroup() %>% 
  ggplot(aes(wild, soil, color=cover_type, size = prop)) + 
  geom_point() +
  theme(legend.position="none") +
  facet_grid(~cover_type) 

Shade measurements at various times of day show that some cover types have more variation than others. The pines and Spruce Fir are more likely to have lower shade values (<100) than the other cover types.

raw_train %>% 
  dplyr::select(cover_type, contains("hillshade")) %>% 
  gather(-cover_type, key = shade, value = val) %>% 
  mutate(shade = factor(shade, levels = c("hillshade_9am", "hillshade_noon", "hillshade_3pm"))) %>% 
  ggplot(aes(cover_type, val)) +
  geom_jitter(aes(color = shade), alpha = 0.05, stroke = 0) +
  geom_boxplot(aes(fill = shade)) +
  coord_flip() +
  facet_grid(~shade) +
  theme(legend.position="none")

Similarly, the distance variables have cutoffs that can help identify a cover type. Lodgepole Pine and Spruce Fir have much wider ranges f distance measurements than the other cover types.

raw_train %>% 
  dplyr::select(cover_type, contains("distance")) %>% 
  gather(-cover_type, key = dist, value = val) %>% 
  ggplot(aes(cover_type, val)) +
  geom_jitter(aes(color = dist), alpha = 0.05, stroke = 0) +
  geom_boxplot(aes(fill = dist)) +
  coord_flip() +
  facet_grid(~dist, scales = "free") +
  theme(legend.position="none")

Looking at the horizontal and vertical distance to hydrology, most cover types have similar, positive correlations. Combining these variables may further help in distinguishing our target variable.

raw_train %>% 
  dplyr::select(cover_type, contains("hydrology")) %>% 
  ggplot(aes(horizontal_distance_to_hydrology, vertical_distance_to_hydrology)) +
  geom_point(aes(color = cover_type), alpha = 0.01, stroke = 0) +
  geom_smooth() +
  facet_grid(~cover_type) +
  theme(legend.position="none")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Predictive Modeling: Methods and Results: Initial modeling results.

A preliminary Random forest model was constructed to gain a deeper understanding of the relationships between variables and predictors.

## Create a data.table from data.frame
#cover_data <- as.data.table(raw_train)
#rm(raw_train)
##run model in parallel
#cl <- makeCluster(detectCores())
#registerDoParallel(cl)
## Run a base forest model for EDA 
#base_forest <- randomForest(cover_type ~ ., data=cover_data, mtry=sqrt(ncol(cover_data)), 
#                    ntree = 300, importance =T, do.trace=50)
#
#stopCluster(cl)
#base_forest
#saveRDS(base_forest, file = '../cache/base_forest.RData')
base_forest <- readRDS('../cache/base_forest.RData')
base_forest
## 
## Call:
##  randomForest(formula = cover_type ~ ., data = cover_data, mtry = sqrt(ncol(cover_data)),      ntree = 300, importance = T, do.trace = 50) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 16.12%
## Confusion matrix:
##                   spruce_fir lodgepole_pine ponderosa_pine
## spruce_fir            122349          25223             63
## lodgepole_pine         18407         177880           1714
## ponderosa_pine             2           1774          22919
## cottonwood_Willow          0              0            831
## aspen                    332           4642            273
## douglasfir                14           2276           5149
## krummholz               3399             83              2
##                   cottonwood_Willow aspen douglasfir krummholz class.error
## spruce_fir                        0     2         10       641  0.17492312
## lodgepole_pine                    5    79        151        75  0.10302505
## ponderosa_pine                   64     3        266         0  0.08426562
## cottonwood_Willow              1085     0          7         0  0.43577743
## aspen                             0  1388         11         0  0.79115257
## douglasfir                       55     0       4663         0  0.61643498
## krummholz                         0     0          0     10873  0.24266908

The variable importnace plot builds upon what is seen in the EDA, and indicates what features should be pursued further. The plot on the left indicates which variables’ inclusion reduces the classification error rate. The greater the MDA (Mean Decrease in Accuracy) the more important this variable is. The plot on the right indicates which variables contribute to the highest homogeneity in the nodes. The greater the mean decrease in Gini the higher the variables contribution to node purity.

varImpPlot(base_forest)

Next Steps: List of statistical techniques that you will pursue.

Appendix